Image-based AI diagnostic performance for fatty liver: a systematic review and meta-analysis

Background The gold standard to diagnose fatty liver is pathology. Recently, image-based artificial intelligence (AI) has been found to have high diagnostic performance. We systematically reviewed studies of image-based AI in the diagnosis of fatty liver. Methods We searched the Cochrane Library, Pubmed, Embase and assessed the quality of included studies by QUADAS-AI. The pooled sensitivity, specificity, negative likelihood ratio (NLR), positive likelihood ratio (PLR), and diagnostic odds ratio (DOR) were calculated using a random effects model. Summary receiver operating characteristic curves (SROC) were generated to identify the diagnostic accuracy of AI models. Results 15 studies were selected in our meta-analysis. Pooled sensitivity and specificity were 92% (95% CI: 90–93%) and 94% (95% CI: 93–96%), PLR and NLR were 12.67 (95% CI: 7.65–20.98) and 0.09 (95% CI: 0.06–0.13), DOR was 182.36 (95% CI: 94.85-350.61). After subgroup analysis by AI algorithm (conventional machine learning/deep learning), region, reference (US, MRI or pathology), imaging techniques (MRI or US) and transfer learning, the model also demonstrated acceptable diagnostic efficacy. Conclusion AI has satisfactory performance in the diagnosis of fatty liver by medical imaging. The integration of AI into imaging devices may produce effective diagnostic tools, but more high-quality studies are needed for further evaluation. Supplementary Information The online version contains supplementary material available at 10.1186/s12880-023-01172-6.

Image-based AI diagnostic performance for fatty liver: a systematic review and metaanalysis

Background
Fatty liver disease has become more and more prevalent in recent years [1], making it the most common chronic liver disease in the world.Fatty liver can lead to steatohepatitis, liver fibrosis, cirrhosis, and even hepatocellular carcinoma, early detection and treatment may stop or even reverse the progression of fatty liver [2].The best reference for diagnosis and classification of hepatic steatosis is the liver biopsy [3].Nevertheless, the high cost [4], sampling errors [5,6], and procedure-related morbidity and mortality [7] make it unsuitable for screening.Therefore, it is urgent and necessary to develop noninvasive diagnostic tools to assess hepatic steatosis.
Imaging is a useful tool to assist decisions of diagnosis, staging, and treatment in clinical practice.Currently, the main diagnostic modalities by medical imaging for fatty liver include magnetic resonance imaging (MRI), ultrasound (US), and computed tomography (CT).Conventional US is cheap, safe, and non-invasive, so it is the most commonly used modality for clinical screening [8].But the diagnostic accuracy in the US is largely dependent on personal judgment which may be susceptible to many factors.CT can effectively detect fatty liver without the influence of abdominal fat.But it is radioactive and expensive, besides, the classification of fatty liver by CT value may be too rough.MRI has high soft tissue resolution and can quantify intrahepatic fat at the molecular level, so it is the main modality for the non-invasive quantification of hepatic steatosis [9].However, the high cost and difficult operation may limit its clinical application.In institutions with limited medical resources, the lack of imaging equipment and experts will make it challenging to obtain the accurate and immediate diagnosis through medical imaging [10].
Artificial intelligence(AI) has made significant advances since the 21st century, especially in medical imaging diagnosis [11], such as conventional machine learning(ML) and deep learning(DL).Concerning the application of AI in medical imaging, a large number of quantitative features can be extracted from radiological images using sophisticated image processing techniques, which are subsequently analyzed by traditional biostatistical or AI models to diagnose or assess therapeutic responses.Several AI-assisted diagnostic models have been developed for fatty liver, such as Han et al. [12] who developed a classifier for the diagnosis of nonalcoholic fatty liver disease(NAFLD), obtaining 97% for sensitivity and 94% for specificity.The model was established by DL using US radio frequency (RF) data with reference to MRI-derived proton density fat fraction (PDFF).Many scholars are trying to improve the diagnostic efficacy of AI models by optimizing image quality, expanding sample size, and modifying algorithms.
To date, little meta-analysis has been conducted to evaluate the diagnostic performance of image-based AI.The study aimed to perform a systematic review and meta-analysis to assess the performance of image-based AI in the diagnosis of fatty liver.

Protocol registration and study design
The study was registered in the PROSPERO(CRD42023388607).The meta-analysis took the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guideline [13] as the reference.

Search strategy
We searched Embase, Pubmed, and Cochrane library for studies of image-based AI in fatty liver until December 24, 2022.The search terms were as follows: "artificial intelligence", "deep learning", "machine learning", "fatty liver", "NAFLD", "non-alcoholic fatty liver disease", "steatohepatitis", "metabolic dysfunction-associated fatty liver disease" and "diagnosis, computer-assisted".The detailed search strategies for each database were summarised in Table S1.

Inclusion and exclusion criteria
We included all articles that used AI in the imaging diagnosis of hepatic steatosis.The inclusion criteria: (1) participants underwent fatty liver-related imaging; (2) references were accurately described.The exclusion criteria: (1) duplicate publications; (2) non-English articles; (3) reviews, meta-analyses, comments, editorials, guidelines, and conference abstracts; (4) non-human samples; (5) pathological images, combined with non-image information, without AI models; (6) studies without enough information to calculate true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values.The titles and abstracts were independently screened according to the eligibility criteria by two reviewers (L-YD and Z-Q), and subsequently downloaded and reviewed the full text.

Data extraction
Two authors (L-YD and Z-Q) conducted the data extraction independently.Any disagreements about the data were determined with the third author(Y-XJ).Data extraction included authors, years, countries, study design, eligibility criteria, age, sample size, data source and range, imaging technique, reference, AI algorithm, and TP, FP, TN, FN values, which were used to calculate sensitivity and specificity.For studies that developed more than one AI model, we selected the one with the best overall performance for analysis.

Quality assessment
Two independent evaluators (L-YD and Z-Q) assessed the quality of all selected studies by the Quality Assessment of Diagnostic Accuracy Studies-AI (QUADAS-AI) criteria [14].The guideline includes four domains in the risk of bias and three domains of applicability (Table S2).The new tool, a combination of QUADAS-2 [15] and QUADAS-C [16], was specifically designed to assess the risk and suitability of bias in AI associated studies.All disagreements were discussed with a third collaborator (Y-XJ).

Statistical analysis
The quality of all selected studies was assessed by Rev-Man using QUADAS-AI, the risk of publication bias was assessed by Stata software (version 17.0) and all other statistical analysis was conducted in Meta-disc (version 1.4).Spearman's correlation coefficient between the log of sensitivity and the log of (1-specificity) was calculated to test the threshold effects, and heterogeneity was tested using the I 2 statistic.A random effects model was used to calculate pooled sensitivities, specificities, negative likelihood ratios (NLR), positive likelihood ratios (PLR), diagnostic odds ratios (DOR), and their 95% confidence intervals(CI) based on crude values of TP, TN, FP and FN values for each study.Summary receiver operating characteristic curves (SROC) were fitted to assess the accuracy of the AI models.The low, medium, and high accuracy were defined as the area under the curve (AUC) values of 0.5-0.7,0.7-0.9, and 0.9-1 respectively [17].Subgroup analyses were then performed: (1) AI algorithm (conventional ML or DL); (2) region; (3) whether transfer learning was performed; (4) reference (US, MRI or pathology); (5)imaging techniques (conventional US, elastography or MRI).The risk of publication bias was assessed by Deeks funnel plots.Fagan plots were drawn to calculate the pre-test and post-test probabilities to evaluate the clinical value.P-values < 0.05 were then considered statistically significant.

Quality assessment
The detailed results of quality assessments of included studies were presented in Figure S1.The risk of bias was shown in more than half of the studies for patient selection (n = 8) and index test (n = 15) because of the lack of detailed descriptions of included patients and appropriate external validation.

Clinical value and publication bias
The post-test probability of image-based AI for the diagnosis of hepatic steatosis was 94%, much higher than the pre-test probability (50%), indicating that image-based AI is valid for the diagnosis of hepatic steatosis (Fig. 3a).And the Deeks funnel plot revealed no obvious publication bias of included studies (P = 0.38) (Fig. 3b).

Discussion
AI has been widely used in medical imaging in recent years, so more and more AI models have been established to diagnose various liver diseases [32,33].We conducted an extensive literature search in medical databases, which was carefully screened and critically assessed by QUA-DAS-AI.Ultimately, we found that AI models performed well in identifying liver steatosis by medical imaging.
AI aims to simulate, extend and expand human intelligence [34].Conventional ML is the method to achieve AI, which can use features extracted from various kinds of data to build prediction models through different algorithms.However, it requires manual extraction of features [35] and the ability of conventional ML to learn from the data was limited [36].DL is the advanced classification of conventional ML which can utilize multiple layers of deep neural networks for a deeper understanding of the data [37].However, DL is prone to overfitting and usually requires more data [38].Our subgroup analysis of the different algorithms showed that the sensitivity was higher in conventional ML, but the specificity, PLR, DOR and SROC were higher in DL.The results revealed the potential advantages of DL in the image-based diagnosis of liver steatosis.
Machine learning is commonly employed in biomedical fields.However, due to insufficient labeled data, the application of advanced machine learning algorithms in clinical settings is limited.Collecting labeled data is timeconsuming, energy-draining, and requires professional expertise.To address this problem, transfer learning can  transfer the acquired knowledge and models from one related task to another, leading to enhanced performance and generalization of the target task [39].For instance, a recent study utilized transfer learning to diagnose corona virus disease (COVID-19) automatically through CT images with a remarkable accuracy of 99.60% [40].In our subgroup analysis, we found that transfer learning led to higher specificity, PLR, and DOR, which highlighted the significance of transfer learning in image-based AI diagnosis of hepatic steatosis.However, only two studies exploited transfer learning, further studies are needed to confirm its effectiveness.The gold standard for the diagnosis of hepatic steatosis is pathology, but there are diagnostic errors in the liver biopsy due to the limitation of sampling.The EASL Clinical Practice Guideline [41] demonstrated that the MRI-PDFF was the most accurate non-invasive method for detecting and quantifying steatosis.So the articles which used experts diagnostic US or MRI-PDFF as references were also selected in our study.We further conducted the subgroup analysis of different reference standards.The For the imaging technique, conventional US, elastography and MRI were included in the selected citations.The subgroup analysis of different imaging techniques showed that the sensitivity and DOR was higher in conventional US than elastography and MRI, which demonstrated that AI seems to be more useful in the conventional US.However, the researches on elastography and MRI were too limited, and the source of the data were different.In the future, further researches are needed to explore the AIassistant efficacy of different imaging techniques.
Additionally, in our subgroup analysis of different regions, we found the sensitivity, DOR and SROC were higher in Asia, the specificity and PLR were higher in Europe and America, which suggested that the regions of included population may influence the diagnostic efficacy of AI for the diagnosis of hepatic steatosis.Most of the studies we included were based on US images, which are susceptible to body size and visceral fat [42].
Westerners are fatter than Asians with greater differences between populations, which may affect the accuracy of AI diagnosis.In the future, an accurate description such as body size and visceral fat of included populations will be needed, so that we can explore the potential influences on the diagnostic efficacy of AI for hepatic steatosis.
There are some advantages in our study.Our study shows the high efficacy of image-based AI in diagnosing hepatic steatosis without publication bias and may provide a reference for future clinical practice.Compared with the previous systematic review of AI-assistant in NAFLD [43], our study mainly explore the diagnostic performance of image-based AI for liver steatosis rather than fibrosis.The number of cited papers (15 citations) was increased and the subgroup analysis of different imaging techniques, AI algorithm, regions and so on may be helpful for future researches.In addition, we employed the new tool QUADAS-AI involve in AI-specific methodology in our study.In the past, the most frequently utilized quality assessment tool for the diagnostic metaanalysis was QUADAS-2.However, it does not involve in AI-specific methodology, such as generalizability and diversity in patient selection, development of training, validation and testing datasets, as well as definition and evaluation of an appropriate reference standard [44].This new tool QUADAS-AI 14 is an AI-specific extension of QUADAS-2 and QUADAS-C, includes four domains in the risk of bias and three domains in applicability concerns, which is more comprehensive and suitable for AI associated studies.Some studies [45][46][47] related to AI models have also employed this new tool.
However, our study has some limitations: firstly, most of the studies were retrospective and did not clearly describe the participants, making it difficult to control many confounding factors.Secondly, none of the included studies underwent suitable external validation, so whether the model can be applied to other populations requires further validation.Finally, there was heterogeneity in our meta-analysis, but no significant threshold effects were found according to Spearman's correlation coefficient and the heterogeneity was reduced in the subgroups which might be the potential resources of the heterogeneity.In the future, we hope that more prospective AI studies with external validation based on large sample sizes can accurately assess the performance of imagebased AI in diagnosing liver steatosis.

Conclusion
This meta-analysis suggested that AI had vast potential for image-based diagnosis of hepatic steatosis.The integration of AI into imaging devices may produce effective diagnostic tools, but more high-quality studies are needed for sufficient validation.

Fig. 1
Fig. 1 The flow of searching and selecting articles

Table 1
Characteristic of studies